feat: enhanced scientific RAG pipeline for research workflows (ISAAC-497)#23
Open
watcharaponthod-code wants to merge 1 commit into
Open
Conversation
…497) Implement section-aware document ingestion, multi-query retrieval with reciprocal rank fusion, stable citation keys, and budget-capped evidence context for the scientific/research RAG workflow. Changes: - Add ui/utils/server/scientific-rag.ts: core RAG utilities * detectScientificSection: identifies abstract/methods/results/etc from chunk text * sectionWeight: importance weights (abstract 1.4x, results 1.3x, methods 1.2x...) * buildChunkMetadata: typed metadata with stable citationKey, strips temp paths * buildResearchQueries: expands query into 4 deterministic variants for recall * fuseQueryResults: RRF + section-weighted deduplication across query result sets * buildEvidencePayload: budget-capped evidence context + source manifest * parseBoundedInteger: safe integer parsing for API params * SCIENTIFIC_SEPARATORS: section-heading-first text splitter separators - Update ui/pages/api/inject-documents.ts: * Use SCIENTIFIC_SEPARATORS for section-aligned chunking (900 char chunks) * Replace processDocuments with buildChunkMetadata for typed, safe metadata * Store citationKey, section, sectionWeight in ChromaDB for downstream ranking - Update ui/pages/api/fetch-documents.ts: * Expand query with buildResearchQueries before Chroma lookup * Apply fuseQueryResults RRF + section-weight fusion across all variants * Return structured evidence payload instead of raw Chroma response - Update ui/pages/api/rag-chat.ts: * Use same-origin URL for fetch-documents (works in any deployment) * Scientific research assistant system prompt with strict citation rules * Use temperature from request instead of hard-coded 0 * Section-prioritised citation rules in prompt (prefer Results > Methods > Abstract) - Add ui/__tests__/scientific-rag.test.ts: 30 tests covering all public helpers Validation: - npx vitest run (30/30 passed) - npx tsc --noEmit (0 errors) Bounty: ISAAC-497 Algora bounty: https://algora.io/isaac/bounties/clq18zr98000ejs0gt0nv7gwu
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bounty: ISAAC-497
Algora bounty: https://algora.io/isaac/bounties/clq18zr98000ejs0gt0nv7gwu
Summary
This PR implements an enhanced RAG pipeline for scientific and research document workflows. Five files changed, adding a dedicated utility module with 30 fully-tested helper functions.
Key improvements
Section-aware chunking: SCIENTIFIC_SEPARATORS splits documents at Abstract/Methods/Results boundaries. Every chunk stores citationKey, section, and sectionWeight in ChromaDB.
Multi-query retrieval with RRF + section weighting: buildResearchQueries expands the user query into 4 deterministic variants. fuseQueryResults applies Reciprocal Rank Fusion with section importance weights (abstract 1.4x, results 1.3x, methods 1.2x, body 0.8x). Duplicate chunks accumulate scores across query variants.
Stable citation keys: buildCitationKey produces deterministic title-slug:pPage:cChunk+1 keys. buildChunkMetadata strips server-side temp upload paths from the public source field.
Scientific chat prompt: system prompt updated to strict research assistant persona - cite every claim by key, prefer Results/Methods evidence over Introduction/Discussion. fetchResearchEvidence uses same-origin URL instead of hard-coded localhost:3000. Temperature taken from request instead of hard-coded 0.
Validation
npx vitest run - 30/30 tests passed
npx tsc --noEmit - 0 type errors
Payout: Algora bounty-platform payout to GitHub user @watcharaponthod-code.